0%

(CVPR 2018) Attention is all you need

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.



1. Overview


1.1. Motivation

  • dominant sequence transduction models based on RNN or CNN
  • RNN can not parallel, computational complexity, hard to learn long-range dependencies


In this paper, it proposed Transformer based solely on attention mechanism

  • without regard to their distance
  • attention to achieve global dependencies
  • reduce sequential computation
  • self-attention (intra-attention)
  • end-to-end memory network. based on recurrent attention mechanism



2. Methods


  • encoder. [x_1, …, x_n]→ [z_1, …, z_n]
  • decoder. [z_1, …, z_n]→ [y_1, …, y_m]


2.1. Encoder

  • N = 6
  • d_model = 512
  • two sub-layers + layer normalizaion + residual
    • multi-head self-attention
    • position-wise FC feed-forward network

2.2. Decoder

  • N = 6
  • insert a third sub-layer. perform multi-head attention over the output of the encoder stack

2.3. Attention



2.3.1. Scaled Dot-Product Attention



2.3.2. Multi-Head Attention



  • h = 8
  • d_k = d_v = d_{model} / h = 64

Found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d_k, d_k and d_v dimensions.

2.4. Multi-Head Attention

  • encoder-decoder attention. Q from previous decoder, K-V from final encoder
  • self-attention in encoder
  • self-attention in decoder

Mask out all value which correspond to illegal connection to prevent leftward information flow in the decoder to preserve the auto-regressive property

2.5. Position-Wise FC



  • or 2 1x1Conv
  • inner-layer d_{ff} = 2048
  • input d_{model} = 512

2.6. Position Encoding

  • since no RNN or CNN, need to inject information about the token’s position
  • use sine and cosine function




3. Experiments


3.1. Details

  • sentence encoded by byte-pair encoding
  • sentence pairs were batched together by approximate sequence length
  • Adam; LR increase linearly by first warmup_steps then decrease



  • Dropout 0.1.

3.2. Comparison



3.3. Ablation Study



  • B. reducing d_k hurts model quality
  • C,D. bigger model better, dropout helpful in avoiding over-fitting
  • E. position embedding nearly identical to sinusoid